A sparse PLS for variable selection when integrating omics data.
نویسندگان
چکیده
Recent biotechnology advances allow for multiple types of omics data, such as transcriptomic, proteomic or metabolomic data sets to be integrated. The problem of feature selection has been addressed several times in the context of classification, but needs to be handled in a specific manner when integrating data. In this study, we focus on the integration of two-block data that are measured on the same samples. Our goal is to combine integration and simultaneous variable selection of the two data sets in a one-step procedure using a Partial Least Squares regression (PLS) variant to facilitate the biologists' interpretation. A novel computational methodology called ;;sparse PLS" is introduced for a predictive analysis to deal with these newly arisen problems. The sparsity of our approach is achieved with a Lasso penalization of the PLS loading vectors when computing the Singular Value Decomposition. Sparse PLS is shown to be effective and biologically meaningful. Comparisons with classical PLS are performed on a simulated data set and on real data sets. On one data set, a thorough biological interpretation of the obtained results is provided. We show that sparse PLS provides a valuable variable selection tool for highly dimensional data sets.
منابع مشابه
integrOmics: an R package to unravel relationships between two omics datasets
MOTIVATION With the availability of many 'omics' data, such as transcriptomics, proteomics or metabolomics, the integrative or joint analysis of multiple datasets from different technology platforms is becoming crucial to unravel the relationships between different biological functional levels. However, the development of such an analysis is a major computational and technical challenge as most...
متن کاملPerformance of PLS regression coefficients in selecting variables for each response of a multivariate PLS for omics-type data
Multivariate partial least square (PLS) regression allows the modeling of complex biological events, by considering different factors at the same time. It is unaffected by data collinearity, representing a valuable method for modeling high-dimensional biological data (as derived from genomics, proteomics and peptidomics). In presence of multiple responses, it is of particular interest how to ap...
متن کاملropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data
4 Hands-on 3 4.1 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.3 Partial least-squares: PLS and PLS-DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.4 Orthogonal partial least square...
متن کاملGroup and sparse group partial least square approaches applied in genomics context
MOTIVATION The association between two blocks of 'omics' data brings challenging issues in computational biology due to their size and complexity. Here, we focus on a class of multivariate statistical methods called partial least square (PLS). Sparse version of PLS (sPLS) operates integration of two datasets while simultaneously selecting the contributing variables. However, these methods do no...
متن کاملHigh dimensional classification with combined adaptive sparse PLS and logistic regression
Motivation The high dimensionality of genomic data calls for the development of specific classification methodologies, especially to prevent over-optimistic predictions. This challenge can be tackled by compression and variable selection, which combined constitute a powerful framework for classification, as well as data visualization and interpretation. However, current proposed combinations le...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Statistical applications in genetics and molecular biology
دوره 7 1 شماره
صفحات -
تاریخ انتشار 2008